Search Results: "dato"

3 June 2017

Mike Hommey: Announcing git-cinnabar 0.5.0 beta 1

Git-cinnabar is a git remote helper to interact with mercurial repositories. It allows to clone, pull and push from/to mercurial remote repositories, using git. Get it on github. These release notes are also available on the git-cinnabar wiki. What s new since 0.4.0?

31 May 2017

Chris Lamb: Free software activities in May 2017

Here is my monthly update covering what I have been doing in the free software world (previous month):
Reproducible builds

Whilst anyone can inspect the source code of free software for malicious flaws, most software is distributed pre-compiled to end users. The motivation behind the Reproducible Builds effort is to permit verification that no flaws have been introduced either maliciously or accidentally during this compilation process by promising identical results are always generated from a given source, thus allowing multiple third-parties to come to a consensus on whether a build was compromised. (I have generously been awarded a grant from the Core Infrastructure Initiative to fund my work in this area.) This month I:
I also made the following changes to our tooling:
diffoscope

diffoscope is our in-depth and content-aware diff utility that can locate and diagnose reproducibility issues.

  • Don't fail when run under perversely-recursive input files. (#780761).

strip-nondeterminism

strip-nondeterminism is our tool to remove specific non-deterministic results from a completed build.

  • Move from verbose_print to nonquiet_print so we print when normalising a file. This is so we can start to target the removal of strip-nondeterminism itself.
  • Only print log messages by default if the file was actually modified. (#863033)
  • Update package long descriptions to clarify that the tool itself is a temporary workaround. (#862029)


Debian My activities as the current Debian Project Leader are covered in my "Bits from the DPL" email to the debian-devel-announce list. However, I:
  • Represented Debian at the OSCAL 2017 in Tirana, Albania.
  • Attended the Reproducible Builds hackathon in Hamburg, Germany. (Report)
  • Finally, I attended Debian SunCamp 2017 in Lloret de Mar in Catalonia, Spain.

Patches contributed
  • xarchiver: Adding files to .tar.xz deletes existing content. (#862593)
  • screen-message: Please invert the default colours. (#862056)
  • fontconfig: fc-cache returns with exit code 0 on 256 errors. (#863427)
  • quadrapassel: Segfaults when unpausing a paused finished game. (#863106)
  • camping: Broken symlink. (#861040)
  • dns-root-data: Does not build if /bin/sh is Bash. (#862252)
  • dh-python: bit.ly link doesn't work anymore. (#863074)

Debian LTS

This month I have been paid to work 18 hours on Debian Long Term Support (LTS). In that time I did the following:
  • "Frontdesk" duties, triaging CVEs, adding links to upstream patches, etc.
  • Issued DLA 930-1 fixing a remote application crash vulnerability in libxstream-java, a Java library to serialize objects to XML and back again
  • Issued DLA 935-1 correcting a local denial of service vulnerability in lxterminal, the terminal emulator for the LXDE desktop environment.
  • Issued DLA 940-1 to remedy an issue in sane-backends which allowed remote attackers to obtain sensitive memory information via a crafted SANE_NET_CONTROL_OPTION packet.
  • Issued DLA 943-1 for the deluge bittorrent client to fix a directory traversal attack vulnerability in the web user interface.
  • Issued DLA 949-1 fixing an integer signedness error in the miniupnpc UPnP client that could allow remote attackers to cause a denial of service attack.
  • Issued DLA 959-1 for the libical calendaring library. A use-after-free vulnerability could allow remote attackers could cause a denial of service and possibly read heap memory via a specially crafted .ICS file.

Uploads
  • redis (3:3.2.9-1) New upstream release.
  • python-django:
    • 1:1.11.1-1 New upstream minor release.
    • 1:1.11.1-2 & 1:1.11.1-3 Add missing Build-Depends on libgdal-dev due to new GIS tests.
  • docbook-to-man:
    • 1:2.0.0-36 Adopt package. Apply a patch to prevent undefined behaviour caused by a memcpy(3) parameter overlap. (#842635, #858389)
    • 1:2.0.0-37 Install manpages using debian/docbook-to-man.manpages over manual calls.
  • installation-birthday Initial upload and misc. subsequent fixes.
  • bfs:
    • 1.0-3 Fix FTBFS on hurd-i386. (#861569)
    • 1.0.1-1 New upstream release & correct debian/watch file.

I also made the following non-maintainer uploads (NMUs):
  • ca-certificates (20161130+nmu1) Remove StartCom and WoSign certificates as they are now untrusted by the major browser vendors. (#858539)
  • sane-backends (1.0.25-4.1) Correct missing error handler in (generated) prerm script. (#862334)
  • seqan2 (2.3.1+dfsg-3.1) Fix broken /usr/bin/splazers symlink on 32-bit architectures. (#863669)
  • jackeq (0.5.9-2.1) Fix a segmentation fault caused by passing a truncated pointer instead of a GtkType. (#863416)
  • kluppe (0.6.20-1.1) Fix segmentation fault at startup. (#863421)
  • coyim (0.3.7-2.1) Skip tests that require internet access to avoid FTBFS. (#863414)
  • pavuk (0.9.35-6.1) Fix segmentation fault when opening "Limitations" window. (#863492)
  • porg (2:0.10-1.1) Fix broken LD_PRELOAD path. (#863495)
  • timemachine (0.3.3-2.1) Fix two segmentation faults caused by truncated pointers. (#863420)

Debian bugs filed
  • acct: Docs incorrectly installed to "accounting.html" directory. (#862180)
  • git-hub: Does not work with 2FA-enabled accounts. (#863265)
  • libwibble: Homepage and Vcs-Darcs fields are outdated. (#861673)



I additionally filed 2 bugs for packages that access the internet during build against flower and r-bioc-gviz.


I also filed 6 FTBFS bugs against cronutils, isoquery, libgnupg-interface-perl, maven-plugin-tools, node-dateformat, password-store & simple-tpm-pk11.

FTP Team

As a Debian FTP assistant I ACCEPTed 105 packages: boinc-app-eah-brp, debug-me, e-mem, etcd, fdroidcl, firejail, gcc-6-cross-ports, gcc-7-cross-ports, gcc-defaults, gl2ps, gnome-software, gnupg2, golang-github-dlclark-regexp2, golang-github-dop251-goja, golang-github-nebulouslabs-fastrand, golang-github-pkg-profile, haskell-call-stack, haskell-foundation, haskell-nanospec, haskell-parallel-tree-search, haskell-posix-pty, haskell-protobuf, htmlmin, iannix, libarchive-cpio-perl, libexternalsortinginjava-java, libgetdata, libpll, libtgvoip, mariadb-10.3, maven-resolver, mysql-transitional, network-manager, node-async-each, node-aws-sign2, node-bcrypt-pbkdf, node-browserify-rsa, node-builtin-status-codes, node-caseless, node-chokidar, node-concat-with-sourcemaps, node-console-control-strings, node-create-ecdh, node-create-hash, node-create-hmac, node-cryptiles, node-dot, node-ecc-jsbn, node-elliptic, node-evp-bytestokey, node-extsprintf, node-getpass, node-gulp-coffee, node-har-schema, node-har-validator, node-hawk, node-jsprim, node-memory-fs, node-pbkdf2, node-performance-now, node-set-immediate-shim, node-sinon-chai, node-source-list-map, node-stream-array, node-string-decoder, node-stringstream, node-verror, node-vinyl-sourcemaps-apply, node-vm-browserify, node-webpack-sources, node-wide-align, odil, onionshare, opensvc, otb, perl, petsc4py, pglogical, postgresql-10, psortb, purl, pymodbus, pymssql, python-decouple, python-django-rules, python-glob2, python-ncclient, python-parse-type, python-prctl, python-sparse, quoin-clojure, quorum, r-bioc-genomeinfodbdata, radlib, reprounzip, rustc, sbt-test-interface, slepc4py, slick-greeter, sparse, te923con, trabucco, traildb, typescript-types & writegood-mode. I additionally filed 6 RC bugs against packages that had incomplete debian/copyright files against: libgetdata, odil, opensvc, python-ncclient, radlib and reprounzip.

3 May 2017

Vincent Bernat: VXLAN: BGP EVPN with Cumulus Quagga

VXLAN is an overlay network to encapsulate Ethernet traffic over an existing (highly available and scalable, possibly the Internet) IP network while accomodating a very large number of tenants. It is defined in RFC 7348. For an uncut introduction on its use with Linux, have a look at my VXLAN & Linux post. VXLAN deployment In the above example, we have hypervisors hosting a virtual machines from different tenants. Each virtual machine is given access to a tenant-specific virtual Ethernet segment. Users are expecting classic Ethernet segments: no MAC restrictions1, total control over the IP addressing scheme they use and availability of multicast. In a large VXLAN deployment, two aspects need attention:
  1. discovery of other endpoints (VTEPs) sharing the same VXLAN segments, and
  2. avoidance of BUM frames (broadcast, unknown unicast and multicast) as they have to be forwarded to all VTEPs.
A typical solution for the first point is using multicast. For the second point, this is source-address learning.

Introduction to BGP EVPN BGP EVPN (RFC 7432 and draft-ietf-bess-evpn-overlay for its application to VXLAN) is a standard control protocol to efficiently solves those two aspects without relying on multicast nor source-address learning. BGP EVPN relies on BGP (RFC 4271) and its MP-BGP extensions (RFC 4760). BGP is the routing protocol powering the Internet. It is highly scalable and interoperable. It is also extensible and one of its extension is MP-BGP. This extension can carry reachability information (NLRI) for multiple protocols (IPv4, IPv6, L3VPN and in our case EVPN). EVPN is a special family to advertise MAC addresses and the remote equipments they are attached to. There are basically two kinds of reachability information a VTEP sends through BGP EVPN:
  1. the VNIs they have interest in (type 3 routes), and
  2. for each VNI, the local MAC addresses (type 2 routes).
The protocol also covers other aspects of virtual Ethernet segments (L3 reachability information from ARP/ND caches, MAC mobility and multi-homing2) but we won t describe them here. To deploy BGP EVPN, a typical solution is to use several route reflectors (both for redundancy and scalability), like in the picture below. Each VTEP opens a BGP session to at least two route reflectors, sends its information (MACs and VNIs) and receives others . This reduces the number of BGP sessions to configure. VXLAN deployment with route reflectors Compared to other solutions to deploy VXLAN, BGP EVPN has three main advantages:
  • interoperability with other vendors (notably Juniper and Cisco),
  • proven scalability (a typical BGP routers handle several millions of routes), and
  • possibility to enforce fine-grained policies.
On Linux, Cumulus Quagga is a fairly complete implementation of BGP EVPN (type 3 routes for VTEP discovery, type 2 routes with MAC or IP addresses, MAC mobility when a host changes from one VTEP to another one) which requires very little configuration. This is a fork of Quagga and currently used in Cumulus Linux, a network operating system based on Debian powering switches from various brands. At some point, BGP EVPN support will be contributed back to FRR, a community-maintained fork of Quagga3. It should be noted the BGP EVPN implementation of Cumulus Quagga currently only supports IPv4.

Route reflector setup Before configuring each VTEP, we need to configure two or more route reflectors. There are many solutions. I will present three of them:
  • using Cumulus Quagga,
  • using GoBGP, an implementation of BGP in Go,
  • using Juniper JunOS.
For reliability purpose, it s possible (and easy) to use one implementation for some route reflectors and another implementation for the other ones. The proposed configurations are quite minimal. However, it is possible to centralize policies on the route reflectors (e.g. routes tagged with some community can only be readvertised to some group of VTEPs).

Using Quagga The configuration is pretty simple. We suppose the configured route reflector has 203.0.113.254 configured as a loopback IP.
router bgp 65000
  bgp router-id 203.0.113.254
  bgp cluster-id 203.0.113.254
  bgp log-neighbor-changes
  no bgp default ipv4-unicast
  neighbor fabric peer-group
  neighbor fabric remote-as 65000
  neighbor fabric capability extended-nexthop
  neighbor fabric update-source 203.0.113.254
  bgp listen range 203.0.113.0/24 peer-group fabric
  !
  address-family evpn
   neighbor fabric activate
   neighbor fabric route-reflector-client
  exit-address-family
  !
  exit
!
A peer group fabric is defined and we leverage the dynamic neighbor feature of Cumulus Quagga: we don t have to explicitely define each neighbor. Any client from 203.0.113.0/24 and presenting itself as part of AS 65000 can connect. All sent EVPN routes will be accepted and reflected to the other clients. You don t need to run Zebra, the route engine talking with the kernel. Instead, start bgpd with the --no_kernel flag.

Using GoBGP GoBGP is a clean implementation of BGP in Go4. It exposes an RPC API for configuration (but accepts a configuration file and comes with a command-line client). It doesn t support dynamic neighbors, so you ll have to use the API, the command-line client or some templating language to automate their declaration. A configuration with only one neighbor is like this:
global:
  config:
    as: 65000
    router-id: 203.0.113.254
    local-address-list:
      - 203.0.113.254
neighbors:
  - config:
      neighbor-address: 203.0.113.1
      peer-as: 65000
    afi-safis:
      - config:
          afi-safi-name: l2vpn-evpn
    route-reflector:
      config:
        route-reflector-client: true
        route-reflector-cluster-id: 203.0.113.254
More neighbors can be added from the command line:
$ gobgp neighbor add 203.0.113.2 as 65000 \
>         route-reflector-client 203.0.113.254 \
>         --address-family evpn
GoBGP won t try to interact with the kernel which is fine as a route reflector.

Using Juniper JunOS A variety of Juniper products can be a BGP route reflector, notably: The main factor is the CPU and the memory. The QFX5100 is low on memory and won t support large deployments without some additional policing. Here is a configuration similar to the Quagga one:
interfaces  
    lo0  
        unit 0  
            family inet  
                address 203.0.113.254/32;
             
         
     
 
protocols  
    bgp  
        group fabric  
            family evpn  
                signaling  
                    /* Do not try to install EVPN routes */
                    no-install;
                 
             
            type internal;
            cluster 203.0.113.254;
            local-address 203.0.113.254;
            allow 203.0.113.0/24;
         
     
 
routing-options  
    router-id 203.0.113.254;
    autonomous-system 65000;
 

VTEP setup The next step is to configure each VTEP/hypervisor. Each VXLAN is locally configured using a bridge for local virtual interfaces, like illustrated in the below schema. The bridge is taking care of the local MAC addresses (notably, using source-address learning) and the VXLAN interface takes care of the remote MAC addresses (received with BGP EVPN). Bridged VXLAN device VXLANs can be provisioned with the following script. Source-address learning is disabled as we will rely solely on BGP EVPN to synchronize FDBs between the hypervisors.
for vni in 100 200; do
    # Create VXLAN interface
    ip link add vxlan$ vni  type vxlan
        id $ vni  \
        dstport 4789 \
        local 203.0.113.2 \
        nolearning
    # Create companion bridge
    brctl addbr br$ vni 
    brctl addif br$ vni  vxlan$ vni 
    brctl stp br$ vni  off
    ip link set up dev br$ vni 
    ip link set up dev vxlan$ vni 
done
# Attach each VM to the appropriate segment
brctl addif br100 vnet10
brctl addif br100 vnet11
brctl addif br200 vnet12
The configuration of Cumulus Quagga is similar to the one used for a route reflector, except we use the advertise-all-vni directive to publish all local VNIs.
router bgp 65000
  bgp router-id 203.0.113.2
  no bgp default ipv4-unicast
  neighbor fabric peer-group
  neighbor fabric remote-as 65000
  neighbor fabric capability extended-nexthop
  neighbor fabric update-source dummy0
  ! BGP sessions with route reflectors
  neighbor 203.0.113.253 peer-group fabric
  neighbor 203.0.113.254 peer-group fabric
  !
  address-family evpn
   neighbor fabric activate
   advertise-all-vni
  exit-address-family
  !
  exit
!
If everything works as expected, the instances sharing the same VNI should be able to ping each other. If IPv6 is enabled on the VMs, the ping command shows if everything is in order:
$ ping -c10 -w1 -t1 ff02::1%eth0
PING ff02::1%eth0(ff02::1%eth0) 56 data bytes
64 bytes from fe80::5254:33ff:fe00:8%eth0: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from fe80::5254:33ff:fe00:b%eth0: icmp_seq=1 ttl=64 time=4.98 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:9%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
64 bytes from fe80::5254:33ff:fe00:a%eth0: icmp_seq=1 ttl=64 time=4.99 ms (DUP!)
--- ff02::1%eth0 ping statistics ---
1 packets transmitted, 1 received, +3 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.016/3.745/4.991/2.152 ms

Verification Step by step, let s check how everything comes together.

Getting VXLAN information from the kernel On each VTEP, Quagga should be able to retrieve the information about configured VXLANs. This can be checked with vtysh:
# show interface vxlan100
Interface vxlan100 is up, line protocol is up
  Link ups:       1    last: 2017/04/29 20:01:33.43
  Link downs:     0    last: (never)
  PTM status: disabled
  vrf: Default-IP-Routing-Table
  index 11 metric 0 mtu 1500
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: 62:42:7a:86:44:01
  inet6 fe80::6042:7aff:fe86:4401/64
  Interface Type Vxlan
  VxLAN Id 100
  Access VLAN Id 1
  Master (bridge) ifindex 9 ifp 0x56536e3f3470
The important points are:
  • the VNI is 100, and
  • the bridge device was correctly detected.
Quagga should also be able to retrieve information about the local MAC addresses :
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 2
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:0a local  eth1.100
50:54:33:00:00:0b local  eth2.100

BGP sessions Each VTEP has to establish a BGP session to the route reflectors. On the VTEP, this can be checked by running vtysh:
# show bgp neighbors 203.0.113.254
BGP neighbor is 203.0.113.254, remote AS 65000, local AS 65000, internal link
 Member of peer-group fabric for session parameters
  BGP version 4, remote router ID 203.0.113.254
  BGP state = Established, up for 00:00:45
  Neighbor capabilities:
    4 Byte AS: advertised and received
    AddPath:
      L2VPN EVPN: RX advertised L2VPN EVPN
    Route refresh: advertised and received(new)
    Address family L2VPN EVPN: advertised and received
    Hostname Capability: advertised
    Graceful Restart Capabilty: advertised
[...]
 For address family: L2VPN EVPN
  fabric peer-group member
  Update group 1, subgroup 1
  Packet Queue length 0
  Community attribute sent to this neighbor(both)
  8 accepted prefixes

  Connections established 1; dropped 0
  Last reset never
Local host: 203.0.113.2, Local port: 37603
Foreign host: 203.0.113.254, Foreign port: 179
The output includes the following information:
  • the BGP state is Established,
  • the address family L2VPN EVPN is correctly advertised, and
  • 8 routes are received from this route reflector.
The state of the BGP sessions can also be checked from the route reflectors. With GoBGP, use the following command:
# gobgp neighbor 203.0.113.2
BGP neighbor is 203.0.113.2, remote AS 65000, route-reflector-client
  BGP version 4, remote router ID 203.0.113.2
  BGP state = established, up for 00:04:30
  BGP OutQ = 0, Flops = 0
  Hold time is 9, keepalive interval is 3 seconds
  Configured hold time is 90, keepalive interval is 30 seconds
  Neighbor capabilities:
    multiprotocol:
        l2vpn-evpn:     advertised and received
    route-refresh:      advertised and received
    graceful-restart:   received
    4-octet-as: advertised and received
    add-path:   received
    UnknownCapability(73):      received
    cisco-route-refresh:        received
[...]
  Route statistics:
    Advertised:             8
    Received:               5
    Accepted:               5
With JunOS, use the below command:
> show bgp neighbor 203.0.113.2
Peer: 203.0.113.2+38089 AS 65000 Local: 203.0.113.254+179 AS 65000
  Group: fabric                Routing-Instance: master
  Forwarding routing-instance: master
  Type: Internal    State: Established
  Last State: OpenConfirm   Last Event: RecvKeepAlive
  Last Error: None
  Options: <Preference LocalAddress Cluster AddressFamily Rib-group Refresh>
  Address families configured: evpn
  Local Address: 203.0.113.254 Holdtime: 90 Preference: 170
  NLRI evpn: NoInstallForwarding
  Number of flaps: 0
  Peer ID: 203.0.113.2     Local ID: 203.0.113.254     Active Holdtime: 9
  Keepalive Interval: 3          Group index: 0    Peer index: 2
  I/O Session Thread: bgpio-0 State: Enabled
  BFD: disabled, down
  NLRI for restart configured on peer: evpn
  NLRI advertised by peer: evpn
  NLRI for this session: evpn
  Peer supports Refresh capability (2)
  Stale routes from peer are kept for: 300
  Peer does not support Restarter functionality
  NLRI that restart is negotiated for: evpn
  NLRI of received end-of-rib markers: evpn
  NLRI of all end-of-rib markers sent: evpn
  Peer does not support LLGR Restarter or Receiver functionality
  Peer supports 4 byte AS extension (peer-as 65000)
  NLRI's for which peer can receive multiple paths: evpn
  Table bgp.evpn.0 Bit: 20000
    RIB State: BGP restart is complete
    RIB State: VPN restart is complete
    Send state: in sync
    Active prefixes:              5
    Received prefixes:            5
    Accepted prefixes:            5
    Suppressed due to damping:    0
    Advertised prefixes:          8
  Last traffic (seconds): Received 276  Sent 170  Checked 276
  Input messages:  Total 61     Updates 3       Refreshes 0     Octets 1470
  Output messages: Total 62     Updates 4       Refreshes 0     Octets 1775
  Output Queue[1]: 0            (bgp.evpn.0, evpn)
If a BGP session cannot be established, the logs of each BGP daemon should mention the cause.

Sent routes From each VTEP, Quagga needs to send:
  • one type 3 route for each local VNI, and
  • one type 2 route for each local MAC address.
The best place to check the received routes is on one of the route reflectors. If you are using JunOS, the following command will display the received routes from the provided VTEP:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2
bgp.evpn.0: 10 destinations, 10 routes (10 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
  2:203.0.113.2:100::0::50:54:33:00:00:0a/304 MAC/IP
*                         203.0.113.2                  100        I
  2:203.0.113.2:100::0::50:54:33:00:00:0b/304 MAC/IP
*                         203.0.113.2                  100        I
  3:203.0.113.2:100::0::203.0.113.2/304 IM
*                         203.0.113.2                  100        I
  3:203.0.113.2:200::0::203.0.113.2/304 IM
*                         203.0.113.2                  100        I
There is one type 3 route for VNI 100 and another one for VNI 200. There are also two type 2 routes for two MAC addresses on VNI 100. To get more information, you can add the keyword extensive. Here is a type 3 route advertising 203.0.113.2 as a VTEP for VNI 1008:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2 extensive
bgp.evpn.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
* 3:203.0.113.2:100::0::203.0.113.2/304 IM (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.2:100
     Nexthop: 203.0.113.2
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
[...]
Here is a type 2 route announcing the location of the 50:54:33:00:00:0a MAC address for VNI 100:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.2 extensive
bgp.evpn.0: 11 destinations, 11 routes (11 active, 0 holddown, 0 hidden)
* 2:203.0.113.2:100::0::50:54:33:00:00:0a/304 MAC/IP (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.2:100
     Route Label: 100
     ESI: 00:00:00:00:00:00:00:00:00:00
     Nexthop: 203.0.113.2
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
[...]
With Quagga, you can get a similar output with vtysh:
# show bgp evpn route
BGP table version is 0, local router ID is 203.0.113.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 203.0.113.2:100
*>i[2]:[0]:[0]:[48]:[50:54:33:00:00:0a]
                    203.0.113.2                   100      0 i
*>i[2]:[0]:[0]:[48]:[50:54:33:00:00:0b]
                    203.0.113.2                   100      0 i
*>i[3]:[0]:[32]:[203.0.113.2]
                    203.0.113.2                   100      0 i
Route Distinguisher: 203.0.113.2:200
*>i[3]:[0]:[32]:[203.0.113.2]
                    203.0.113.2                   100      0 i
[...]
With GoBGP, use the following command:
# gobgp global rib -a evpn   grep rd:203.0.113.2:200
    Network  Next Hop             AS_PATH              Age        Attrs
*>  [type:macadv][rd:203.0.113.2:100][esi:single-homed][etag:0][mac:50:54:33:00:00:0a][ip:<nil>][labels:[100]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:macadv][rd:203.0.113.2:100][esi:single-homed][etag:0][mac:50:54:33:00:00:0b][ip:<nil>][labels:[100]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:macadv][rd:203.0.113.2:200][esi:single-homed][etag:0][mac:50:54:33:00:00:0a][ip:<nil>][labels:[200]]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435656] ]
*>  [type:multicast][rd:203.0.113.2:100][etag:0][ip:203.0.113.2]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435556] ]
*>  [type:multicast][rd:203.0.113.2:200][etag:0][ip:203.0.113.2]203.0.113.2                               00:00:17   [ Origin: i   LocalPref: 100   Extcomms: [VXLAN], [65000:268435656] ]

Received routes Each VTEP should have received the type 2 and type 3 routes from its fellow VTEPs, through the route reflectors. You can check with the show bgp evpn route command of vtysh. Does Quagga correctly understand the received routes? The type 3 routes are translated to an assocation between the remote VTEPs and the VNIs:
# show evpn vni
Number of VNIs: 2
VNI        VxLAN IF              VTEP IP         # MACs   # ARPs   Remote VTEPs
100        vxlan100              203.0.113.2     4        0        203.0.113.3
                                                                   203.0.113.1
200        vxlan200              203.0.113.2     3        0        203.0.113.3
                                                                   203.0.113.1
The type 2 routes are translated to an association between the remote MACs and the remote VTEPs:
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 4
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:09 remote 203.0.113.1
50:54:33:00:00:0a local  eth1.100
50:54:33:00:00:0b local  eth2.100
50:54:33:00:00:0c remote 203.0.113.3

FDB configuration The last step is to ensure Quagga has correctly provided the received information to the kernel. This can be checked with the bridge command:
# bridge fdb show dev vxlan100   grep dst
00:00:00:00:00:00 dst 203.0.113.1 self permanent
00:00:00:00:00:00 dst 203.0.113.3 self permanent
50:54:33:00:00:0c dst 203.0.113.3 self
50:54:33:00:00:09 dst 203.0.113.1 self
All good! The two first lines are the translation of the type 3 routes (any BUM frame will be sent to both 203.0.113.1 and 203.0.113.3) and the two last ones are the translation of the type 2 routes.

Interoperability One of the strength of BGP EVPN is the interoperability with other network vendors. To demonstrate it works as expected, we will configure a Juniper vMX to act as a VTEP. First, we need to configure the physical bridge9. This is similar to the use of ip link and brctl with Linux. We only configure one physical interface with two old-school VLANs paired with matching VNIs.
interfaces  
    ge-0/0/1  
        unit 0  
            family bridge  
                interface-mode trunk;
                vlan-id-list [ 100 200 ];
             
         
     
 
routing-instances  
    switch  
        instance-type virtual-switch;
        interface ge-0/0/1.0;
        bridge-domains  
            vlan100  
                domain-type bridge;
                vlan-id 100;
                vxlan  
                    vni 100;
                    ingress-node-replication;
                 
             
            vlan200  
                domain-type bridge;
                vlan-id 200;
                vxlan  
                    vni 200;
                    ingress-node-replication;
                 
             
         
     
 
Then, we configure BGP EVPN to advertise all known VNIs. The configuration is quite similar to the one we did with Quagga:
protocols  
    bgp  
        group fabric  
            type internal;
            multihop;
            family evpn signaling;
            local-address 203.0.113.3;
            neighbor 203.0.113.253;
            neighbor 203.0.113.254;
         
     
 
routing-instances  
    switch  
        vtep-source-interface lo0.0;
        route-distinguisher 203.0.113.3:1; #  
        vrf-import EVPN-VRF-VXLAN;
        vrf-target  
            target:65000:1;
            auto;
         
        protocols  
            evpn  
                encapsulation vxlan;
                extended-vni-list all;
                multicast-mode ingress-replication;
             
         
     
 
routing-options  
    router-id 203.0.113.3;
    autonomous-system 65000;
 
policy-options  
    policy-statement EVPN-VRF-VXLAN  
        then accept;
     
 
We also need a small compatibility patch for Cumulus Quagga10. The routes sent by this configuration are very similar to the routes sent by Quagga. The main differences are:
  • on JunOS, the route distinguisher is configured statically (in ), and
  • on JunOS, the VNI is also encoded as an Ethernet tag ID.
Here is a type 3 route, as sent by JunOS:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.3 extensive
bgp.evpn.0: 13 destinations, 13 routes (13 active, 0 holddown, 0 hidden)
* 3:203.0.113.3:1::100::203.0.113.3/304 IM (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.3:1
     Nexthop: 203.0.113.3
     Localpref: 100
     AS path: I
     Communities: target:65000:268435556 encapsulation:vxlan(0x8)
     PMSI: Flags 0x0: Label 6: Type INGRESS-REPLICATION 203.0.113.3
[...]
Here is a type 2 route:
> show route table bgp.evpn.0 receive-protocol bgp 203.0.113.3 extensive
bgp.evpn.0: 13 destinations, 13 routes (13 active, 0 holddown, 0 hidden)
* 2:203.0.113.3:1::200::50:54:33:00:00:0f/304 MAC/IP (1 entry, 1 announced)
     Accepted
     Route Distinguisher: 203.0.113.3:1
     Route Label: 200
     ESI: 00:00:00:00:00:00:00:00:00:00
     Nexthop: 203.0.113.3
     Localpref: 100
     AS path: I
     Communities: target:65000:268435656 encapsulation:vxlan(0x8)
[...]
We can check that the vMX is able to make sense of the routes it receives from its peers running Quagga:
> show evpn database l2-domain-id 100
Instance: switch
VLAN  DomainId  MAC address        Active source                  Timestamp        IP address
     100        50:54:33:00:00:0c  203.0.113.1                    Apr 30 12:46:20
     100        50:54:33:00:00:0d  203.0.113.2                    Apr 30 12:32:42
     100        50:54:33:00:00:0e  203.0.113.2                    Apr 30 12:46:20
     100        50:54:33:00:00:0f  ge-0/0/1.0                     Apr 30 12:45:55
On the other end, if we look at one of the Quagga-based VTEP, we can check the received routes are correctly understood:
# show evpn vni 100
VNI: 100
 VxLAN interface: vxlan100 ifIndex: 9 VTEP IP: 203.0.113.1
 Remote VTEPs for this VNI:
  203.0.113.3
  203.0.113.2
 Number of MACs (local and remote) known for this VNI: 4
 Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 0
# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 4
MAC               Type   Intf/Remote VTEP      VLAN
50:54:33:00:00:0c local  eth1.100
50:54:33:00:00:0d remote 203.0.113.2
50:54:33:00:00:0e remote 203.0.113.2
50:54:33:00:00:0f remote 203.0.113.3
Get in touch if you have some success with other vendors!

  1. For example, they may use bridges to connect containers together.
  2. Such a feature can replace proprietary implementations of MC-LAG allowing several VTEPs to act as a endpoint for a single link aggregation group. This is not needed on our scenario where hypervisors act as VTEPs.
  3. The development of Quagga is slow and closed . New features are often stalled. FRR is placed under the umbrella of the Linux Foundation, has a GitHub-centered development model and an election process. It already has several interesting enhancements (notably, BGP add-path, BGP unnumbered, MPLS and LDP).
  4. I am unenthusiastic about projects whose the sole purpose is to rewrite something in Go. However, while being quite young, GoBGP is quite valuable on its own (good architecture, good performance).
  5. The 48-port version is around $10,000 with the BGP license.
  6. An empty chassis with a dual routing engine (RE-S-1800X4-16G) is around $30,000.
  7. I don t know how pricey the vRR is. For evaluation purposes, it can be downloaded for free if you are a customer.
  8. The value 100 used in the route distinguishier (203.0.113.2:100) is not the one used to encode the VNI. The VNI is encoded in the route target (65000:268435556), in the 24 least signifiant bits (268435556 & 0xffffff equals 100). As long as VNIs are unique, we don t have to understand those details.
  9. For some reason, the use of a virtual switch is mandatory. This is specific to this platform: a QFX doesn t require this.
  10. The encoding of the VNI into the route target is being standardized in draft-ietf-bess-evpn-overlay. Juniper already implements this draft.

11 April 2017

Riku Voipio: Deploying OBS

Open Build Service from SuSE is web service building deb/rpm packages. It has recently been added to Debian, so finally there is relatively easy way to set up PPA style repositories in Debian. Relative as in "there is a learning curve, but nowhere near the complexity of replicating Debian's internal infrastructure". OBS will give you both repositories and build infrastructure with a clickety web UI and command line client (osc) to manage. See Hectors blog for quickstart instructions. Things to learned while setting up OBSMe coming from Debian background, and OBS coming from SuSE/RPM world, there are some quirks that can take by surprise. Well done packagingUsually web services are a tough fit for Distros. The cascade of weird dependencies and build systems where the only practical way to build an "open source" web service is by replicating the upstream CI scripts. Not in case of OBS. Being done by distro people shows. OBS does automatic rebuilds of reverse dependenciesAka automatic binNMUs when you update a library. This however means you need lots of build power around. OBS has it's own dependency resolver on the server that recalculate what packages need rebuilding when - workers just get a list of packages to install for build-depends. This a major divergence from Debian, where sbuild handles dependencies client side. The OBS dependency handler doesn't handle virtual packages* / alternative build-deps like Debian - you may have to add a specific "Prefer: foo-dev" into the OBS project config to solve alternative choices. OBS server and worker do http requests in both directionsOn startup workers connect to OBS server, open a TCP port and wait requests coming OBS. Having connections both directions is a bit of hassle firewall-wise. On the bright side, no need to setup uploads via FTP here.. Signing repositories is complicatedWith Debian 9.0 making signed repositories pretty much mandatory, OBS makes signing rather complicated. obs-signd isn't included in Debian, since it depends on gnupg patch that hasn't been upstreamed. Fortunately I found a workaround. OBS signs release files with /usr/bin/sign -d /path/to/release. Where replacing the obs-signd provided sign command your own script is easy ;) Git integration is rather bolted-on than integratedOBS provides a method to integrate with git using services. - There is no clickety UI to link to git repo, instead you make an xml file called _service with osc. There is no way to have debian/ tree in git. The upstream community is friendlyIncluding the happiest thanks from an upstream I've seen recently. SummaryAll in all rather satisfied with OBS. If you have a home-grown jenkins etc based solution for building DEB/RPM packages, you should definitely consider OBS. For simpler uses, no need to install OBS yourself, openSUSE public OBS will happily build Debian packages for you. *How useful are virtual packages anymore? "foo-defaults" packages seem to be the go-to solution for most real usecases anyways.

8 March 2017

Antoine Beaupr : An update to GitHub's terms of service

On February 28th, GitHub published a brand new version of its Terms of Service (ToS). While the first draft announced earlier in February didn't generate much reaction, the new ToS raised concerns that they may break at least the spirit, if not the letter, of certain free-software licenses. Digging in further reveals that the situation is probably not as dire as some had feared. The first person to raise the alarm was probably Thorsten Glaser, a Debian developer, who stated that the "new GitHub Terms of Service require removing many Open Source works from it". His concerns are mainly about section D of the document, in particular section D.4 which states:
You grant us and our legal successors the right to store and display your Content and make incidental copies as necessary to render the Website and provide the Service.
Section D.5 then goes on to say:
[...] You grant each User of GitHub a nonexclusive, worldwide license to access your Content through the GitHub Service, and to use, display and perform your Content, and to reproduce your Content solely on GitHub as permitted through GitHub's functionality

ToS versus GPL The concern here is that the ToS bypass the normal provisions of licenses like the GPL. Indeed, copyleft licenses are based on copyright law which forbid users from doing anything with the content unless they comply with the license, which forces, among other things, "share alike" properties. By granting GitHub and its users rights to reproduce content without explicitly respecting the original license, the ToS may allow users to bypass the copyleft nature of the license. Indeed, as Joey Hess, author of git-annex, explained :
The new TOS is potentially very bad for copylefted Free Software. It potentially neuters it entirely, so GPL licensed software hosted on Github has an implicit BSD-like license
Hess has since removed all his content (mostly mirrors) from GitHub. Others disagree. In a well-reasoned blog post, Debian developer Jonathan McDowell explained the rationale behind the changes:
My reading of the GitHub changes is that they are driven by a desire to ensure that GitHub are legally covered for the things they need to do with your code in order to run their service.
This seems like a fair point to make: GitHub needs to protect its own rights to operate the service. McDowell then goes on to do a detailed rebuttal of the arguments made by Glaser, arguing specifically that section D.5 "does not grant [...] additional rights to reproduce outside of GitHub". However, specific problems arise when we consider that GitHub is a private corporation that users have no control over. The "Services" defined in the ToS explicitly "refers to the applications, software, products, and services provided by GitHub". The term "Services" is therefore not limited to the current set of services. This loophole may actually give GitHub the right to bypass certain provisions of licenses used on GitHub. As Hess detailed in a later blog post:
If Github tomorrow starts providing say, an App Store service, that necessarily involves distribution of software to others, and they put my software in it, would that be allowed by this or not? If that hypothetical Github App Store doesn't sell apps, but licenses access to them for money, would that be allowed under this license that they want to my software?
However, when asked on IRC, Bradley M. Kuhn of the Software Freedom Conservancy explained that "ultimately, failure to comply with a copyleft license is a copyright infringement" and that the ToS do outline a process to deal with such infringement. Some lawyers have also publicly expressed their disagreement with Glaser's assessment, with Richard Fontana from Red Hat saying that the analysis is "basically wrong". It all comes down to the intent of the ToS, as Kuhn (who is not a lawyer) explained:
any license can be abused or misused for an intent other than its original intent. It's why it matters to get every little detail right, and I hope Github will do that.
He went even further and said that "we should assume the ambiguity in their ToS as it stands is favorable to Free Software". The ToS are in effect since February 28th; users "can accept them by clicking the broadcast announcement on your dashboard or by continuing to use GitHub". The immediacy of the change is one of the reasons why certain people are rushing to remove content from GitHub: there are concerns that continuing to use the service may be interpreted as consent to bypass those licenses. Hess even hosted a separate copy of the ToS [PDF] for people to be able to read the document without implicitly consenting. It is, however, unclear how a user should remove their content from the GitHub servers without actually agreeing to the new ToS.

CLAs When I read the first draft, I initially thought there would be concerns about the mandatory Contributor License Agreement (CLA) in section D.5 of the draft:
[...] unless there is a Contributor License Agreement to the contrary, whenever you make a contribution to a repository containing notice of a license, you license your contribution under the same terms, and agree that you have the right to license your contribution under those terms.
I was concerned this would establish the controversial practice of forcing CLAs on every GitHub user. I managed to find a post from a lawyer, Kyle E. Mitchell, who commented on the draft and, specifically, on the CLA. He outlined issues with wording and definition problems in that section of the draft. In particular, he noted that "contributor license agreement is not a legal term of art, but an industry term" and "is a bit fuzzy". This was clarified in the final draft, in section D.6, by removing the use of the CLA term and by explicitly mentioning the widely accepted norm for licenses: "inbound=outbound". So it seems that section D.6 is not really a problem: contributors do not need to necessarily delegate copyright ownership (as some CLAs require) when they make a contribution, unless otherwise noted by a repository-specific CLA. An interesting concern he raised, however, was with how GitHub conducted the drafting process. A blog post announced the change on February 7th with a link to a form to provide feedback until the 21st, with a publishing deadline of February 28th. This gave little time for lawyers and developers to review the document and comment on it. Users then had to basically accept whatever came out of the process as-is. Unlike every software project hosted on GitHub, the ToS document is not part of a Git repository people can propose changes to or even collaboratively discuss. While Mitchell acknowledges that "GitHub are within their rights to update their terms, within very broad limits, more or less however they like, whenever they like", he sets higher standards for GitHub than for other corporations, considering the community it serves and the spirit it represents. He described the process as:
[...] consistent with the value of CYA, which is real, but not with the output-improving virtues of open process, which is also real, and a great deal more pleasant.
Mitchell also explained that, because of its position, GitHub can have a major impact on the free-software world.
And as the current forum of preference for a great many developers, the knock-on effects of their decisions throw big weight. While GitHub have the wheel and they ve certainly earned it for now they can do real damage.
In particular, there have been some concerns that the ToS change may be an attempt to further the already diminishing adoption of the GPL for free-software projects; on GitHub, the GPL has been surpassed by the MIT license. But Kuhn believes that attitudes at GitHub have begun changing:
GitHub historically had an anti-copyleft culture, which was created in large part by their former and now ousted CEO, Preston-Warner. However, recently, I've seen people at GitHub truly reach out to me and others in the copyleft community to learn more and open their minds. I thus have a hard time believing that there was some anti-copyleft conspiracy in this ToS change.

GitHub response However, it seems that GitHub has actually been proactive in reaching out to the free software community. Kuhn noted that GitHub contacted the Conservancy to get its advice on the ToS changes. While he still thinks GitHub should fix the ambiguities quickly, he also noted that those issues "impact pretty much any non-trivial Open Source and Free Software license", not just copylefted material. When reached for comments, a GitHub spokesperson said:
While we are confident that these Terms serve the best needs of the community, we take our users' feedback very seriously and we are looking closely at ways to address their concerns.
Regardless, free-software enthusiasts have other concerns than the new ToS if they wish to use GitHub. First and foremost, most of the software running GitHub is proprietary, including the JavaScript served to your web browser. GitHub also created a centralized service out of a decentralized tool (Git). It has become the largest code hosting service in the world after only a few years and may well have become a single point of failure for free software collaboration in a way we have never seen before. Outages and policy changes at GitHub can have a major impact on not only the free-software world, but also the larger computing world that relies on its services for daily operation. There are now free-software alternatives to GitHub. GitLab.com, for example, does not seem to have similar licensing issues in its ToS and GitLab itself is free software, although based on the controversial open core business model. The GitLab hosting service still needs to get better than its grade of "C" in the GNU Ethical Repository Criteria Evaluations (and it is being worked on); other services like GitHub and SourceForge score an "F". In the end, all this controversy might have been avoided if GitHub was generally more open about the ToS development process and gave more time for feedback and reviews by the community. Terms of service are notorious for being confusing and something of a legal gray area, especially for end users who generally click through without reading them. We should probably applaud the efforts made by GitHub to make its own ToS document more readable and hope that, with time, it will address the community's concerns.
Note: this article first appeared in the Linux Weekly News.

20 February 2017

Petter Reinholdtsen: Detect OOXML files with undefined behaviour?

I just noticed the new Norwegian proposal for archiving rules in the goverment list ECMA-376 / ISO/IEC 29500 (aka OOXML) as valid formats to put in long term storage. Luckily such files will only be accepted based on pre-approval from the National Archive. Allowing OOXML files to be used for long term storage might seem like a good idea as long as we forget that there are plenty of ways for a "valid" OOXML document to have content with no defined interpretation in the standard, which lead to a question and an idea. Is there any tool to detect if a OOXML document depend on such undefined behaviour? It would be useful for the National Archive (and anyone else interested in verifying that a document is well defined) to have such tool available when considering to approve the use of OOXML. I'm aware of the officeotron OOXML validator, but do not know how complete it is nor if it will report use of undefined behaviour. Are there other similar tools available? Please send me an email if you know of any such tool.

19 February 2017

Gregor Herrmann: RC bugs 2016/52-2017/07

debian is in deep freeze for the upcoming stretch release. still, I haven't dived into fixing "general" release-critical bugs yet; so far I mostly kept to working on bugs in the debian perl group: thanks to the release team for pro-actively unblocking the packages with fixes which were uploaded after the begin of the freeze!

5 February 2017

Vincent Bernat: A Makefile for your Go project

My most loathed feature of Go is the mandatory use of GOPATH: I do not want to put my own code next to its dependencies. Hopefully, this issue is slowly starting to be accepted by the main authors. In the meantime, you can workaround this problem with more opinionated tools (like gb) or by crafting your own Makefile. For the later, you can have a look at Filippo Valsorda s example or my own take which I describe in more details here. This is not meant to be an universal Makefile but a relatively short one with some batteries included. It comes with a simple Hello World! application.

Project structure For a standalone project, vendoring is a must-have1 as you cannot rely on your dependencies to not introduce backward-incompatible changes. Some packages are using versioned URLs but most of them aren t. There is currently no standard tool to handle vendoring. My personal take is to vendor all dependencies with Glide. It is a good practice to split an application into different packages while the main one stay fairly small. In the hellogopher example, the CLI is handled in the cmd package while the application logic for printing greetings is in the hello package:
.
  cmd/
    hello.go
    root.go
    version.go
  glide.lock (generated)
  glide.yaml
  vendor/ (dependencies will go there)
  hello/
    root.go
    root_test.go
  main.go
  Makefile
  README.md

Down the rabbit hole Let s take a look at the various features of the Makefile.

GOPATH handling Since all dependencies are vendored, only our own project needs to be in the GOPATH:
PACKAGE  = hellogopher
GOPATH   = $(CURDIR)/.gopath
BASE     = $(GOPATH)/src/$(PACKAGE)
$(BASE):
    @mkdir -p $(dir $@)
    @ln -sf $(CURDIR) $@
The base import path is hellogopher, not github.com/vincentbernat/hellogopher: this shortens imports and makes them easily distinguishable from imports of dependency packages. However, your application won t be go get-able. This is a personal choice and can be adjusted with the $(PACKAGE) variable. We just create a symlink from .gopath/src/hellogopher to our root directory. The GOPATH environment variable is automatically exported to the shell commands of the recipes. Any tool should work fine after changing the current directory to $(BASE). For example, this snippet builds the executable:
.PHONY: all
all:   $(BASE)
    cd $(BASE) && $(GO) build -o bin/$(PACKAGE) main.go

Vendoring dependencies Glide is a bit like Ruby s Bundler. In glide.yaml, you specify what packages you need and the constraints you want on them. Glide computes a glide.lock file containing the exact versions for each dependencies (including recursive dependencies) and download them in the vendor/ folder. I choose to check into the VCS both glide.yaml and glide.lock files. It s also possible to only check in the first one or to also check in the vendor/ directory. A work-in-progress is currently ongoing to provide a standard dependency management tool with a similar workflow. We define two rules2:
GLIDE = glide
glide.lock: glide.yaml   $(BASE)
    cd $(BASE) && $(GLIDE) update
    @touch $@
vendor: glide.lock   $(BASE)
    cd $(BASE) && $(GLIDE) --quiet install
    @ln -sf . vendor/src
    @touch $@
We use a variable to invoke glide. This enables a user to easily override it (for example, with make GLIDE=$GOPATH/bin/glide).

Using third-party tools Most projects need some third-party tools. We can either expect them to be already installed or compile them in our private GOPATH. For example, here is the lint rule:
BIN    = $(GOPATH)/bin
GOLINT = $(BIN)/golint
$(BIN)/golint:   $(BASE) #  
    go get github.com/golang/lint/golint
.PHONY: lint
lint: vendor   $(BASE) $(GOLINT) #  
    @cd $(BASE) && ret=0 && for pkg in $(PKGS); do \
        test -z "$$($(GOLINT) $$pkg   tee /dev/stderr)"   ret=1 ; \
     done ; exit $$ret
As for glide, we let the user a chance to override which golint executable to use. By default, it uses a private copy. But a user can use its own copy with make GOLINT=/usr/bin/golint. In , we have the recipe to build the private copy. We simply issue go get3 to download and build golint. In , the lint rule executes golint on each package contained in the $(PKGS) variable. We ll explain this variable in the next section.

Working with non-vendored packages only Some commands need to be provided with a list of packages. Because we use a vendor/ directory, the shortcut ./... is not what we expect as we don t want to run tests on our dependencies4. Therefore, we compose a list of packages we care about:
PKGS = $(or $(PKG), $(shell cd $(BASE) && \
    env GOPATH=$(GOPATH) $(GO) list ./...   grep -v "^$(PACKAGE)/vendor/"))
If the user has provided the $(PKG) variable, we use it. For example, if they want to lint only the cmd package, they can invoke make lint PKG=hellogopher/cmd which is more intuitive than specifying PKGS. Otherwise, we just execute go list ./... but we remove anything from the vendor directory.

Tests Here are some rules to run tests:
TIMEOUT = 20
TEST_TARGETS := test-default test-bench test-short test-verbose test-race
.PHONY: $(TEST_TARGETS) check test tests
test-bench:   ARGS=-run=__absolutelynothing__ -bench=.
test-short:   ARGS=-short
test-verbose: ARGS=-v
test-race:    ARGS=-race
$(TEST_TARGETS): test
check test tests: fmt lint vendor   $(BASE)
    @cd $(BASE) && $(GO) test -timeout $(TIMEOUT)s $(ARGS) $(PKGS)
A user can invoke tests in different ways:
  • make test runs all tests;
  • make test TIMEOUT=10 runs all tests with a timeout of 10 seconds;
  • make test PKG=hellogopher/cmd only runs tests for the cmd package;
  • make test ARGS="-v -short" runs tests with the specified arguments;
  • make test-race runs tests with race detector enabled.

Tests coverage go test includes a test coverage tool. Unfortunately, it only handles one package at a time and you have to explicitely list the packages to be instrumented, otherwise the instrumentation is limited to the currently tested package. If you provide too many packages, the compilation time will skyrocket. Moreover, if you want an output compatible with Jenkins, you ll need some additional tools.
COVERAGE_MODE    = atomic
COVERAGE_PROFILE = $(COVERAGE_DIR)/profile.out
COVERAGE_XML     = $(COVERAGE_DIR)/coverage.xml
COVERAGE_HTML    = $(COVERAGE_DIR)/index.html
.PHONY: test-coverage test-coverage-tools
test-coverage-tools:   $(GOCOVMERGE) $(GOCOV) $(GOCOVXML) #  
test-coverage: COVERAGE_DIR := $(CURDIR)/test/coverage.$(shell date -Iseconds)
test-coverage: fmt lint vendor test-coverage-tools   $(BASE)
    @mkdir -p $(COVERAGE_DIR)/coverage
    @cd $(BASE) && for pkg in $(PKGS); do \ #  
        $(GO) test \
            -coverpkg=$$($(GO) list -f '  join .Deps "\n"  ' $$pkg   \
                    grep '^$(PACKAGE)/'   grep -v '^$(PACKAGE)/vendor/'   \
                    tr '\n' ',')$$pkg \
            -covermode=$(COVERAGE_MODE) \
            -coverprofile="$(COVERAGE_DIR)/coverage/ echo $$pkg   tr "/" "-" .cover" $$pkg ;\
     done
    @$(GOCOVMERGE) $(COVERAGE_DIR)/coverage/*.cover > $(COVERAGE_PROFILE)
    @$(GO) tool cover -html=$(COVERAGE_PROFILE) -o $(COVERAGE_HTML)
    @$(GOCOV) convert $(COVERAGE_PROFILE)   $(GOCOVXML) > $(COVERAGE_XML)
First, we define some variables to let the user override them. We also require the following tools (in ):
  • gocovmerge merges profiles from different runs into a single one;
  • gocov-xml converts a coverage profile to the Cobertura format;
  • gocov is needed to convert a coverage profile to a format handled by gocov-xml.
The rules to build those tools are similar to the rule for golint described a few sections ago. In , for each package to test, we run go test with the -coverprofile argument. We also explicitely provide the list of packages to instrument to -coverpkg by using go list to get a list of dependencies for the tested package and keeping only our owns.

Final result While the main goal of using a Makefile was to work around GOPATH, it s also a good place to hide the complexity of some operations, notably around test coverage. The excerpts provided in this post are a bit simplified. Have a look at the final result for more perks!

  1. In Go, vendoring is about both bundling and dependency management. As the Go ecosystem matures, the bundling part (fixed snapshots of dependencies) may become optional but the vendor/ directory may stay for dependency management (retrieval of the latest versions of dependencies matching a set of constraints).
  2. If you don t want to automatically update glide.lock when a change is detected in glide.yaml, rename the target to deps-update and make it a phony target.
  3. There is some irony for bad mouthing go get and then immediately use it because it is convenient.
  4. I think ./... should not include the vendor/ directory by default. Dependencies should be trusted to have run their own tests in the environment they expect them to succeed. Unfortunately, this is unlikely to change.

2 February 2017

Paul Wise: FLOSS Activities January 2017

Changes

Issues

Review

Administration
  • Debian: reboot 1 non-responsive VM, redirect 2 users to support channels, redirect 1 contributor to xkb upstream, redirect 1 potential contributor, redirect 1 bug reporter to mirror team, ping 7 folks about restarting processes with upgraded libs, manually restart the sectracker process due to upgraded libs, restart the package tracker process due to upgraded libs, investigate failures connecting to the XMPP service, investigate /dev/shm issue on abel.d.o, clean up after rename of the fedmsg group.
  • Debian mentors: lintian/security updates & reboot
  • Debian packages: deploy 2 contributions to the live server
  • Debian wiki: unblacklist 1 IP address, whitelist 10 email addresses, disable 18 accounts with bouncing email, update email for 2 accounts with bouncing email, reported 1 Debian member as MIA, redirect 1 user to support channels, add 4 domains to the whitelist.
  • Reproducible builds: rescheduled Debian pyxplot:amd64/unstable for themill.
  • Openmoko: security updates & reboots.

Debian derivatives
  • Send the annual activity ping mail.
  • Happy new year messages on IRC, forward to the list.
  • Note that SerbianLinux does not provide source packages.
  • Expand URL shortener on SerbianLinux page.
  • Invite PelicanHPC, Netrunner, DietPi, Hamara Linux (on IRC), BitKey to the census.
  • Add research publications link to the census template
  • Fix Symbiosis sources.list
  • Enquired about SalentOS downtime
  • Fixed and removed some 404 BlankOn links (blog, English homepage)
  • Fixed changes to AstraLinux sources.list
  • Welcome Netrunner to the census

Sponsors I renewed my support of Software Freedom Conservancy. The openchange 1:2.2-6+deb8u1 upload was sponsored by my employer. All other work was done on a volunteer basis.

15 January 2017

Mehdi Dogguy: Debian from 10,000 feet

Many of you are big fans of S.W.O.T analysis, I am sure of that! :-) Technical competence is our strongest suit, but we have reached a size and sphere of influence which requires an increase in organisation.

We all love our project and want to make sure Debian still shines in the next decades (and centuries!). One way to secure that goal is to identify elements/events/things which could put that goal at risk. To this end, we've organized a short S.W.O.T analysis session at DebConf16. Minutes of the meeting can be found here. I believe it is an interesting read and is useful for Debian old-timers as well as newcomers. It helps to convey a better understanding of the project's status. For each item, we've tried to identify an action.

Here are a few things we've worked on:
During next DebConf, we can review the progress that has been made on each item and discuss new ones. In addition to this session acting as a health check, I see it as a way for the DPL to discuss, openly and publicly, about the important changes that should be implemented in the project and imagine together a better future.

In the meantime, everyone should feel free to pick one item from the list and work on it. :-)

12 December 2016

Kees Cook: security things in Linux v4.9

Previously: v4.8. Here are a bunch of security things I m excited about in the newly released Linux v4.9: Latent Entropy GCC plugin Building on her earlier work to bring GCC plugin support to the Linux kernel, Emese Revfy ported PaX s Latent Entropy GCC plugin to upstream. This plugin is significantly more complex than the others that have already been ported, and performs extensive instrumentation of functions marked with __latent_entropy. These functions have their branches and loops adjusted to mix random values (selected at build time) into a global entropy gathering variable. Since the branch and loop ordering is very specific to boot conditions, CPU quirks, memory layout, etc, this provides some additional uncertainty to the kernel s entropy pool. Since the entropy actually gathered is hard to measure, no entropy is credited , but rather used to mix the existing pool further. Probably the best place to enable this plugin is on small devices without other strong sources of entropy. vmapped kernel stack and thread_info relocation on x86 Normally, kernel stacks are mapped together in memory. This meant that attackers could use forms of stack exhaustion (or stack buffer overflows) to reach past the end of a stack and start writing over another process s stack. This is bad, and one way to stop it is to provide guard pages between stacks, which is provided by vmalloced memory. Andy Lutomirski did a bunch of work to move to vmapped kernel stack via CONFIG_VMAP_STACK on x86_64. Now when writing past the end of the stack, the kernel will immediately fault instead of just continuing to blindly write. Related to this, the kernel was storing thread_info (which contained sensitive values like addr_limit) at the bottom of the kernel stack, which was an easy target for attackers to hit. Between a combination of explicitly moving targets out of thread_info, removing needless fields, and entirely moving thread_info off the stack, Andy Lutomirski and Linus Torvalds created CONFIG_THREAD_INFO_IN_TASK for x86. CONFIG_DEBUG_RODATA mandatory on arm64 As recently done for x86, Mark Rutland made CONFIG_DEBUG_RODATA mandatory on arm64. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so there s no reason to make the protection optional. random_page() cleanup Cleaning up the code around the userspace ASLR implementations makes them easier to reason about. This has been happening for things like the recent consolidation on arch_mmap_rnd() for ET_DYN and during the addition of the entropy sysctl. Both uncovered some awkward uses of get_random_int() (or similar) in and around arch_mmap_rnd() (which is used for mmap (and therefore shared library) and PIE ASLR), as well as in randomize_stack_top() (which is used for stack ASLR). Jason Cooper cleaned things up further by doing away with randomize_range() entirely and replacing it with the saner random_page(), making the per-architecture arch_randomize_brk() (responsible for brk ASLR) much easier to understand. That s it for now! Let me know if there are other fun things to call attention to in v4.9.

2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

4 December 2016

Ben Hutchings: Linux Kernel Summit 2016, part 2

I attended this year's Linux Kernel Summit in Santa Fe, NM, USA and made notes on some of the sessions that were relevant to Debian. LWN also reported many of the discussions. This is the second and last part of my notes; part 1 is here. Kernel Hardening Kees Cook presented the ongoing work on upstream kernel hardening, also known as the Kernel Self-Protection Project or KSPP. GCC plugins The kernel build system can now build and use GCC plugins to implement some protections. This requires gcc 4.5 and the plugin headers installed. It has been tested on x86, arm, and arm64. It is disabled by CONFIG_COMPILE_TEST because CI systems using allmodconfig/allyesconfig probably don't have those installed, but this ought to be changed at some point. There was a question as to how plugin headers should be installed for cross-compilers or custom compilers, but I didn't hear a clear answer to this. Kees has been prodding distribution gcc maintainers to package them. Mark Brown mentioned the Linaro toolchain being widely used; Kees has not talked to its maintainers yet. Probabilistic protections These protections are based on hidden state that an attacker will need to discover in order to make an effective attack; they reduce the probability of success but don't prevent it entirely. Kernel address space layout randomisation (KASLR) has now been implemented on x86, arm64, and mips for the kernel image. (Debian enables this.) However there are still lots of information leaks that defeat this. This could theoretically be improved by relocating different sections or smaller parts of the kernel independently, but this requires re-linking at boot. Aside from software information leaks, the branch target predictor on (common implementations of) x86 provides a side channel to find addresses of branches in the kernel. Page and heap allocation, etc., is still quite predictable. struct randomisation (RANDSTRUCT plugin from grsecurity) reorders members in (a) structures containing only function pointers (b) explicitly marked structures. This makes it very hard to attack custom kernels where the kernel image is not readable. But even for distribution kernels, it increases the maintenance burden for attackers. Deterministic protections These protections block a class of attacks completely. Read-only protection of kernel memory is either mandatory or enabled by default on x86, arm, and arm64. (Debian enables this.) Protections against execution of user memory in kernel mode are now implemented in hardware on x86 (SMEP, in Intel processors from Skylake onward) and on arm64 (PXN, from ARMv8.1). But Skylake is not available for servers and ARMv8.1 is not yet implemented at all! s390 always had this protection. It may be possible to 'emulate' this using other hardware protections. arm (v7) and arm64 now have this, but x86 doesn't. Linus doesn't like the overhead of previously proposed implementations for x86. It is possible to do this using PCID (in Intel processors from Sandy Bridge onward), which has already been done in PaX - and this should be fast enough. Virtually mapped stacks protect against stack overflow attacks. They were implemented as an option for x86 only in 4.9. (Debian enables this.) Copies to or from user memory sometimes use a user-controlled size that is not properly bounded. Hardened usercopy, implemented as an option in 4.8 for many architectures, protects against this. (Debian enables this.) Memory wiping (zero on free) protects against some information leaks and use-after-free bugs. It was already implemented as debug feature with non-zero poison value, but at some performance cost. Zeroing can be cheaper since it allows allocator to skip zeroing on reallocation. That was implemented as an option in 4.6. (Debian does not currently enable this but we might do if the performance cost is low enough.) Constification (with the CONSTIFY gcc plugin) reduces the amount of static data that can be written to. As with RANDSTRUCT, this is applied to function pointer tables and explicitly marked structures. Instances of some types need to be modified very occasionally. In PaX/Grsecurity this is done with pax_ open,close _kernel() which globally disable write protection temporarily. It would be preferable to override write protection in a more directed way, so that the permission to write doesn't leak into any other code that interrupts this process. The feature is not in mainline yet. Atomic wrap detction protects against reference-counting bugs which can result in a use-after-free. Overflow and underflow are trapped and result in an 'oops'. There is no measurable performance impact. It would be applied to all operations on the atomic_t type, but there needs to be an opt-out for atomics that are not ref-counters - probably by adding an atomic_wrap_t type for them. This has been implemented for x86, arm, and arm64 but is not in mainline yet. Kernel Freezer Hell For the second year running, Jiri Kosina raised the problem of 'freezing' kthreads (kernel-mode threads) in preparation for system suspend (suspend to RAM, or hibernation). What are the semantics? What invariants should be met when a kthread gets frozen? They are not defined anywhere. Most freezable threads don't actually need to be quiesced. Also many non-freezable threads are pointlessly calling try_to_freeze() (probably due to copying code without understanding it)). At a system level, what we actually need is I/O and filesystem consistency. This should be achieved by: The system suspend code should not need to directly freeze threads. Kernel Documentation Jon Corbet and Mauro Carvalho presented the recent work on kernel documentation. The kernel's documentation system was a house of cards involving DocBook and a lot of custom scripting. Both the DocBook templates and plain text files are gradually being converted to reStructuredText format, processed by Sphinx. However, manual page generation is currently 'broken' for documents processed by Sphinx. There are about 150 files at the top level of the documentation tree, that are being gradually moved into subdirectories. The most popular files, that are likely to be referenced in external documentation, have been replaced by placeholders. Sphinx is highly extensible and this has been used to integrate kernel-doc. It would be possible to add extensions that parse and include the MAINTAINERS file and Documentation/ABI/ files, which have their own formats, but the documentation maintainers would prefer not to add extensions that can't be pushed to Sphinx upstream. There is lots of obsolete documentation, and patches to remove those would be welcome. Linus objected to PDF files recently added under the Documentation/media directory - they are not the source format so should not be there! They should be generated from the corresponding SVG or image files at build time. Issues around Tracepoints Steve Rostedt and Shuah Khan led a discussion about tracepoints. Currently each maintainer decides which tracepoints to create. The cost of each added tracepoint is minimal, but the cost of very many tracepoints is more substantial. So there is such a thing as too many tracepoints, and we need a policy to decide when they are justified. They advised not to create tracepoints just in case, since kprobes can be used for tracing (almost) anywhere dynamically. There was some support for requiring documentation of each new tracepoint. That may dissuade introduction of obscure tracepoints, but also creates a higher expectation of stability. Tools such as bcc and IOVisor are now being created that depend on specific tracepoints or even function names (through kprobes). Should we care about breaking them? Linus said that we should strive to be polite to developers and users relying on tracepoints, but if it's too painful to maintain a tracepoint then we should go ahead and change it. Where the end users of the tool are themselves developers it's more reasonable to expect them to upgrade the tool and we should care less about changing it. In some cases tracepoints could provide dummy data for compatibility (as is done in some places in procfs).

30 November 2016

Chris Lamb: Free software activities in November 2016

Here is my monthly update covering what I have been doing in the free software world (previous month):
Reproducible builds

Whilst anyone can inspect the source code of free software for malicious flaws, most software is distributed pre-compiled to end users. The motivation behind the Reproducible Builds effort is to permit verification that no flaws have been introduced either maliciously or accidentally during this compilation process by promising identical results are always generated from a given source, thus allowing multiple third-parties to come to a consensus on whether a build was compromised.

This month:

My work in the Reproducible Builds project was also covered in our weekly reports. (#80, #81, #82 #83.

Toolchain issues I submitted the following patches to fix reproducibility-related toolchain issues with Debian:

strip-nondeterminism

strip-nondeterminism is our tool to remove specific non-deterministic results from a completed build.


jenkins.debian.net

jenkins.debian.net runs our comprehensive testing framework.

  • buildinfo.debian.net has moved to SSL. (ac3b9e7)
  • Submit signing keys to keyservers after generation. (bdee6ff)
  • Various cosmetic changes, including
    • Prefer if X not in Y over if not X in Y. (bc23884)
    • No need for a dictionary; let's just use a set. (bf3fb6c)
    • Avoid DRY violation by using a for loop. (4125ec5)

I also submitted 9 patches to fix specific reproducibility issues in apktool, cairo-5c, lava-dispatcher, lava-server, node-rimraf, perlbrew, qsynth, tunnelx & zp.

Debian

Debian LTS This month I have been paid to work 11 hours on Debian Long Term Support (LTS). In that time I did the following:
  • "Frontdesk" duties, triaging CVEs, etc.
  • Issued DLA 697-1 for bsdiff fixing an arbitrary write vulnerability.
  • Issued DLA 705-1 for python-imaging correcting a number of memory overflow issues.
  • Issued DLA 713-1 for sniffit where a buffer overflow allowed a specially-crafted configuration file to provide a root shell.
  • Issued DLA 723-1 for libsoap-lite-perl preventing a Billion Laughs XML expansion attack.
  • Issued DLA 724-1 for mcabber fixing a roster push attack.

Uploads
  • redis:
    • 3.2.5-2 Tighten permissions of /var/ lib,log /redis. (#842987)
    • 3.2.5-3 & 3.2.5-4 Improve autopkgtest tests and install upstream's MANIFESTO and README.md documentation.
  • gunicorn (19.6.0-9) Adding autopkgtest tests.
  • libfiu:
    • 0.94-1 Add autopkgtest tests.
    • 0.95-1, 0.95-2 & 0.95-3 New upstream release and improve autopkgtest coverage.
  • python-django (1.10.3-1) New upstream release.
  • aptfs (0.8-3, 0.8-4 & 0.8-5) Adding and subsequently improving the autopkgtext tests.


I performed the following QA uploads:


Finally, I also made the following non-maintainer uploads:
  • libident (0.22-3.1) Move from obsolete Source-Version substvar to binary:Version. (#833195)
  • libpcl1 (1.6-1.1) Move from obsolete Source-Version substvar to binary:Version. (#833196)
  • pygopherd (2.0.18.4+nmu1) Move from obsolete Source-Version substvar to $ source:Version . (#833202)


RC bugs


I also filed 59 FTBFS bugs against arc-gui-clients, asyncpg, blhc, civicrm, d-feet, dpdk, fbpanel, freeciv, freeplane, gant, golang-github-googleapis-gax-go, golang-github-googleapis-proto-client-go, haskell-cabal-install, haskell-fail, haskell-monadcatchio-transformers, hg-git, htsjdk, hyperscan, jasperreports, json-simple, keystone, koji, libapache-mod-musicindex, libcoap, libdr-tarantool-perl, libmath-bigint-gmp-perl, libpng1.6, link-grammar, lua-sql, mediatomb, mitmproxy, ncrack, net-tools, node-dateformat, node-fuzzaldrin-plus, node-nopt, open-infrastructure-system-images, open-infrastructure-system-images, photofloat, ppp, ptlib, python-mpop, python-mysqldb, python-passlib, python-protobix, python-ttystatus, redland, ros-message-generation, ruby-ethon, ruby-nokogiri, salt-formula-ceilometer, spykeviewer, sssd, suil, torus-trooper, trash-cli, twisted-web2, uftp & wide-dhcpv6.

FTP Team

As a Debian FTP assistant I ACCEPTed 70 packages: bbqsql, coz-profiler, cross-toolchain-base, cross-toolchain-base-ports, dgit-test-dummy, django-anymail, django-hstore, django-html-sanitizer, django-impersonate, django-wkhtmltopdf, gcc-6-cross, gcc-defaults, gnome-shell-extension-dashtodock, golang-defaults, golang-github-btcsuite-fastsha256, golang-github-dnephin-cobra, golang-github-docker-go-events, golang-github-gogits-cron, golang-github-opencontainers-image-spec, haskell-debian, kpmcore, libdancer-logger-syslog-perl, libmoox-buildargs-perl, libmoox-role-cloneset-perl, libreoffice, linux-firmware-raspi3, linux-latest, node-babel-runtime, node-big.js, node-buffer-shims, node-charm, node-cliui, node-core-js, node-cpr, node-difflet, node-doctrine, node-duplexer2, node-emojis-list, node-eslint-plugin-flowtype, node-everything.js, node-execa, node-grunt-contrib-coffee, node-grunt-contrib-concat, node-jquery-textcomplete, node-js-tokens, node-json5, node-jsonfile, node-marked-man, node-os-locale, node-sparkles, node-tap-parser, node-time-stamp, node-wrap-ansi, ooniprobe, policycoreutils, pybind11, pygresql, pysynphot, python-axolotl, python-drizzle, python-geoip2, python-mockupdb, python-pyforge, python-sentinels, python-waiting, pythonmagick, r-cran-isocodes, ruby-unicode-display-width, suricata & voctomix-outcasts. I additionally filed 4 RC bugs against packages that had incomplete debian/copyright files against node-cliui, node-core-js, node-cpr & node-grunt-contrib-concat.

7 November 2016

Reproducible builds folks: Reproducible Builds: week 80 in Stretch cycle

What happened in the Reproducible Builds effort between Sunday October 30 and Saturday November 5 2016: Upcoming events Reproducible work in other projects Bugs filed Reviews of unreproducible packages 81 package reviews have been added, 14 have been updated and 43 have been removed in this week, adding to our knowledge about identified issues. 3 issue types have been updated: 1 issue type has been removed: 1 issue type has been updated: Weekly QA work During of reproducibility testing, some FTBFS bugs have been detected and reported by: diffoscope development buildinfo.debian.net development tests.reproducible-builds.org Reproducible Debian: Misc. Also with thanks to Profitbricks sponsoring the "hardware" resources, Holger created a 13 core machine with 24GB RAM and 100GB SSD based storage so that Ximin can do further tests and development on GCC and other software on a fast machine. This week's edition was written by Chris Lamb, Ximin Luo, Vagrant Cascadian, Holger Levsen and reviewed by a bunch of Reproducible Builds folks on IRC.

31 October 2016

Antoine Beaupr : My free software activities, October 2016

Debian Long Term Support (LTS) This is my 7th month working on Debian LTS, started by Raphael Hertzog at Freexian, after a long pause during the summer. I have worked on the following packages and CVEs: I have also helped review work on the following packages:
  • imagemagick: reviewed BenH's work to figure out what was done. unfortunately, I forgot to officially take on the package and Roberto started working on it in the meantime. I nevertheless took time to review Roberto's work and outline possible issues with the original patchset suggested
  • tiff: reviewed Raphael's work on the hairy TIFFTAG_* issues, all the gory details in this email
The work on ImageMagick and GraphicsMagick was particularly intriguing. Looking at the source of those programs makes me wonder why were are still using them at all: it's a tangled mess of C code that is bound to bring up more and more vulnerabilities, time after time. It seems there's always an "Magick" vulnerability waiting to be fixed out there... I somehow hoped that the fork would bring more stability and reliability, but it seems they are suffering from similar issues because, fundamentally, they haven't rewritten ImageMagick... It looks this is something that affects all image programs. The review I have done on the tiff suite give me the same shivering sensation as reviewing the "Magick" code. It feels like all image libraries are poorly implemented and then bound to be exploited somehow... Nevertheless, if I had to use a library of the sort in my software, I would stay away from the "Magick" forks and try something like imlib2 first... Finally, I also did some minor work on the user and developer LTS documentation and some triage work on samba, xen and libass. I also looked at the dreaded CVE-2016-7117 vulnerability in the Linux kernel to verify its impact on wheezy users. I also looked at implementing a --lts flag for dch (see bug #762715). It was difficult to get back to work after such a long pause, but I am happy I was able to contribute a significant number of hours. It's a bit difficult to find work sometimes in LTS-land, even if there's actually always a lot of work to be done. For example, I used to be one of the people doing frontdesk work, but those duties are now assigned until the end of the year, so it's unlikely I will be doing any of that for the forseable future. Similarly, a lot of packages were assigned when I started looking at the available packages. There was an interesting discussion on the internal mailing list regarding unlocking package ownership, because some people had packages locked for weeks, sometimes months, without significant activity. Hopefully that situation will improve after that discussion. Another interesting discussion I participated in is the question of whether the LTS team should be waiting for unstable to be fixed before publishing fixes in oldstable. It seems the consensus right now is that it shouldn't be mandatory to fix issues in unstable before we fix security isssues in oldstable and stable. After all, security support for testing and unstable is limited. But I was happy to learn that working on brand new patches is part of our mandate as part of the LTS work. I did work on such a patch for tar which ended up being adopted by the original reporter, although upstream ended up implementing our recommendation in a better way. It's coincidentally the first time since I start working on LTS that I didn't get the number of requested hours, which means that there are more people working on LTS. That is a good thing, but I am worried it may also mean people are more spread out and less capable of focusing for longer periods of time on more difficult problems. It also means that the team is growing faster than the funding, which is unfortunate: now is a good time as any to remind you to see if you can make your company fund the LTS project if you are still running Debian wheezy.

Other free software work It seems like forever that I did such a report, and while I was on vacation, a lot has happened since the last report.

Monkeysign I have done extensive work on Monkeysign, trying to bring it kicking and screaming in the new world of GnuPG 2.1. This was the objective of the 2.1 release, which collected about two years of work and patches, including arbitrary MUA support (e.g. Thunderbird), config files support, and a release on PyPI. I have had to release about 4 more releases to try and fix the build chain, ship the test suite with the program and have a primitive preferences panel. The 2.2 release also finally features Tor suport! I am also happy to have moved more documentation to Read the docs, part of which I mentionned in in a previous article. The git repositories and issues were also moved to a Gitlab instance which will hopefully improve the collaboration workflow, although we still have issues in streamlining the merge request workflow. All in all, I am happy to be working on Monkeysign, but it has been a frustrating experience. In the last years, I have been maintaining the project largely on my own: although there are about 20 contributors in Monkeysign, I have committed over 90% of the commits in the code. New contributors recently showed up, and I hope this will release some pressure on me being the sole maintainer, but I am not sure how viable the project is.

Funding free software work More and more, I wonder how to sustain my contributions to free software. As a previous article has shown, I work a lot on the computer, even when I am not on a full-time job. Monkeysign has been a significant time drain in the last months, and I have done this work on a completely volunteer basis. I wouldn't mind so much except that there is a lot of work I do on a volunteer basis. This means that I sometimes must prioritize paid consulting work, at the expense of those volunteer projects. While most of my paid work usually revolves around free sofware, the benefits of paid work are not always immediately obvious, as the primary objective is to deliver to the customer, and the community as a whole is somewhat of a side-effect. I have watched with interest joeyh's adventures into crowdfunding which seems to be working pretty well for him. Unfortunately, I cannot claim the incredible (and well-deserved) reputation Joey has, and even if I could, I can't live with 500$ a month. I would love to hear if people would be interested in funding my work in such a way. I am hesitant in launching a crowdfunding campaign because it is difficult to identify what exactly I am working on from one month to the next. Looking back at earlier reports shows that I am all over the place: one month I'll work on a Perl Wiki (Ikiwiki), the next one I'll be hacking at a multimedia home cinema (Kodi). I can hardly think of how to fund those things short of "just give me money to work on anything I feel like", which I can hardly ask for of anyone. Even worse, it feels like the audience here is either friends or colleagues. It would make little sense for me to seek funding from those people: colleagues have the same funding problems I do, and I don't want to empoverish my friends... So far I have taken the approach of trying to get funding for work I am doing, bit by bit. For example, I have recently been told that LWN actually pays for contributed articles and have started running articles by them before publishing them here. This is looking good: they will publish an article I wrote about the Omnia router I have recently received. I give them exclusive rights on the article for two weeks, but I otherwise retain full ownership over the article and will publish them after the exclusive period here. Hopefully, I will be able to find more such projects that pays for the work I do on a day to day basis.

Open Street Map editing I have ramped up my OpenStreetMap contributions, having (temporarily) moved to a different location. There are lots of things to map here: trails, gaz stations and lots of other things are missing from the map. Sometimes the effort looks a bit ridiculous, reminding me of my early days of editing OSM. I have registered to OSM Live, a project to fund OSM editors that, I must admit, doesn't help much with funding my work: with the hundreds of edits I did in October, I received the equivalent of 1.80$CAD in Bitcoins. This may be the lowest hourly salary I have ever received, probably going at a rate of 10 per hour! Still, it's interesting to be able to point people to the project if someone wants to contribute to OSM mappers. But mappers should have no illusions about getting a decent salary from this effort, I am sorry to say.

Bounties I feel this is similar to the "bounty" model used by the Borg project: I claimed around $80USD in that project for what probably amounts to tens of hours of work, yet another salary that would qualify as "poor". Another example is a feature I would like to implement in Borg: support for protocols other than SSH. There is currently no bounty on this, but a similar feature, S3 support has one of the largest bounties Borg has ever seen: $225USD. And the claimant for the bounty hasn't actually implemented the feature, instead backing up to S3, the patch (to a third-party tool) actually enables support for Amazon Cloud Drive, a completely different API. Even at $225, I wouldn't be able to complete any of those features and get a decent salary. As well explained by the Snowdrift reviews, bounties just don't work at all... The ludicrous 10% fee charged by Bountysource made sure I would never do business with them ever again anyways.

Other work There are probably more things I did recently, but I am having difficulty keeping track of the last 5 months of on and off work, so you will forgive that I am not as exhaustive as I usually am.

1 October 2016

Kees Cook: security things in Linux v4.6

Previously: v4.5. The v4.6 Linux kernel release included a bunch of stuff, with much more of it under the KSPP umbrella. seccomp support for parisc Helge Deller added seccomp support for parisc, which including plumbing support for PTRACE_GETREGSET to get the self-tests working. x86 32-bit mmap ASLR vs unlimited stack fixed Hector Marco-Gisbert removed a long-standing limitation to mmap ASLR on 32-bit x86, where setting an unlimited stack (e.g. ulimit -s unlimited ) would turn off mmap ASLR (which provided a way to bypass ASLR when executing setuid processes). Given that ASLR entropy can now be controlled directly (see the v4.5 post), and that the cases where this created an actual problem are very rare, means that if a system sees collisions between unlimited stack and mmap ASLR, they can just adjust the 32-bit ASLR entropy instead. x86 execute-only memory Dave Hansen added Protection Key support for future x86 CPUs and, as part of this, implemented support for execute only memory in user-space. On pkeys-supporting CPUs, using mmap(..., PROT_EXEC) (i.e. without PROT_READ) will mean that the memory can be executed but cannot be read (or written). This provides some mitigation against automated ROP gadget finding where an executable is read out of memory to find places that can be used to build a malicious execution path. Using this will require changing some linker behavior (to avoid putting data in executable areas), but seems to otherwise Just Work. I m looking forward to either emulated QEmu support or access to one of these fancy CPUs. CONFIG_DEBUG_RODATA enabled by default on arm and arm64, and mandatory on x86 Ard Biesheuvel (arm64) and I (arm) made the poorly-named CONFIG_DEBUG_RODATA enabled by default. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so making it on-by-default is required to start any kind of attack surface reduction within the kernel. On x86 CONFIG_DEBUG_RODATA was already enabled by default, but, at Ingo Molnar s suggestion, I made it mandatory: CONFIG_DEBUG_RODATA cannot be turned off on x86. I expect we ll get there with arm and arm64 too, but the protection is still somewhat new on these architectures, so it s reasonable to continue to leave an out for developers that find themselves tripping over it. arm64 KASLR text base offset Ard Biesheuvel reworked a ton of arm64 infrastructure to support kernel relocation and, building on that, Kernel Address Space Layout Randomization of the kernel text base offset (and module base offset). As with x86 text base KASLR, this is a probabilistic defense that raises the bar for kernel attacks where finding the KASLR offset must be added to the chain of exploits used for a successful attack. One big difference from x86 is that the entropy for the KASLR must come either from Device Tree (in the /chosen/kaslr-seed property) or from UEFI (via EFI_RNG_PROTOCOL), so if you re building arm64 devices, make sure you have a strong source of early-boot entropy that you can expose through your boot-firmware or boot-loader. zero-poison after free Laura Abbott reworked a bunch of the kernel memory management debugging code to add zeroing of freed memory, similar to PaX/Grsecurity s PAX_MEMORY_SANITIZE feature. This feature means that memory is cleared at free, wiping any sensitive data so it doesn t have an opportunity to leak in various ways (e.g. accidentally uninitialized structures or padding), and that certain types of use-after-free flaws cannot be exploited since the memory has been wiped. To take things even a step further, the poisoning can be verified at allocation time to make sure that nothing wrote to it between free and allocation (called sanity checking ), which can catch another small subset of flaws. To understand the pieces of this, it s worth describing that the kernel s higher level allocator, the page allocator (e.g. __get_free_pages()) is used by the finer-grained slab allocator (e.g. kmem_cache_alloc(), kmalloc()). Poisoning is handled separately in both allocators. The zero-poisoning happens at the page allocator level. Since the slab allocators tend to do their own allocation/freeing, their poisoning happens separately (since on slab free nothing has been freed up to the page allocator). Only limited performance tuning has been done, so the penalty is rather high at the moment, at about 9% when doing a kernel build workload. Future work will include some exclusion of frequently-freed caches (similar to PAX_MEMORY_SANITIZE), and making the options entirely CONFIG controlled (right now both CONFIGs are needed to build in the code, and a kernel command line is needed to activate it). Performing the sanity checking (mentioned above) adds another roughly 3% penalty. In the general case (and once the performance of the poisoning is improved), the security value of the sanity checking isn t worth the performance trade-off. Tests for the features can be found in lkdtm as READ_AFTER_FREE and READ_BUDDY_AFTER_FREE. If you re feeling especially paranoid and have enabled sanity-checking, WRITE_AFTER_FREE and WRITE_BUDDY_AFTER_FREE can test these as well. To perform zero-poisoning of page allocations and (currently non-zero) poisoning of slab allocations, build with:
CONFIG_DEBUG_PAGEALLOC=n
CONFIG_PAGE_POISONING=y
CONFIG_PAGE_POISONING_NO_SANITY=y
CONFIG_PAGE_POISONING_ZERO=y
CONFIG_SLUB_DEBUG=y
and enable the page allocator poisoning and slab allocator poisoning at boot with this on the kernel command line:
page_poison=on slub_debug=P
To add sanity-checking, change PAGE_POISONING_NO_SANITY=n, and add F to slub_debug as slub_debug=PF . read-only after init I added the infrastructure to support making certain kernel memory read-only after kernel initialization (inspired by a small part of PaX/Grsecurity s KERNEXEC functionality). The goal is to continue to reduce the attack surface within the kernel by making even more of the memory, especially function pointer tables, read-only (which depends on CONFIG_DEBUG_RODATA above). Function pointer tables (and similar structures) are frequently targeted by attackers when redirecting execution. While many are already declared const in the kernel source code, making them read-only (and therefore unavailable to attackers) for their entire lifetime, there is a class of variables that get initialized during kernel (and module) start-up (i.e. written to during functions that are marked __init ) and then never (intentionally) written to again. Some examples are things like the VDSO, vector tables, arch-specific callbacks, etc. As it turns out, most architectures with kernel memory protection already delay making their data read-only until after __init (see mark_rodata_ro()), so it s trivial to declare a new data section ( .data..ro_after_init ) and add it to the existing read-only data section ( .rodata ). Kernel structures can be annotated with the new section (via the __ro_after_init macro), and they ll become read-only once boot has finished. The next step for attack surface reduction infrastructure will be to create a kernel memory region that is passively read-only, but can be made temporarily writable (by a single un-preemptable CPU), for storing sensitive structures that are written to only very rarely. Once this is done, much more of the kernel s attack surface can be made read-only for the majority of its lifetime. As people identify places where __ro_after_init can be used, we can grow the protection. A good place to start is to look through the PaX/Grsecurity patch to find uses of __read_only on variables that are only written to during __init functions. The rest are places that will need the temporarily-writable infrastructure (PaX/Grsecurity uses pax_open_kernel()/pax_close_kernel() for these). That s it for v4.6, next up will be v4.7!

2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

16 August 2016

Lars Wirzenius: 20 years ago I became a Debian developer

Today it is 23 years ago since Ian Murdock published his intention to develop a new Linux distribution, Debian. It also about 20 years since I became a Debian developer and made my first package upload. In the time since: It's been a good twenty years. And the fun ain't over yet.

8 June 2016

Lucas Nussbaum: Re: Sysadmin Skills and University Degrees

Russell Coker wrote about Sysadmin Skills and University Degrees. I couldn t agree more that a major deficiency in Computer Science degrees is the lack of sysadmin training. It seems like most sysadmins learned most of what they know from experience. It s very hard to recruit young engineers (freshly out of university) for sysadmin jobs, and the job interviews are often a bit depressing. Sysadmins jobs are also not very popular with this public, probably because university curriculums fail to emphasize what s exciting about those jobs. However, I think I disagree rather deeply with Russell s detailed analysis. First, Version Control. Well, I think that it s pretty well covered in university curriculums nowadays. From my point of view, teaching CS in Universit de Lorraine (France), mostly in Licence Professionnelle Administration de Syst mes, R seaux et Applications base de Logiciels Libres (warning: french), a BSc degree focusing on Linux systems administration, it s not usual to see student projects with a mandatory use of Git. And it doesn t seem to be a major problem for students (which always surprises me). However, I wouldn t rate Version Control as the most important thing that is required for a sysadmin. Similarly Dependencies and Backups are things that should be covered, but probably not as first class citizens. I think that there are several pillars in the typical sysadmin knowledge. First and foremost, sysadmins need a good understanding of the inner workings of an operating system. I sometimes feel that many Operating Systems Design courses are a bit too much focused on the Design side of things. Yes, it s useful to understand the low-level mechanisms, and be able to (mentally) recreate an OS from scratch. But it s also interesting to know how real systems are actually built, and what are the trade-off involved. I very much enjoyed reading Branden Gregg s Systems Performance: Enterprise and the Cloud because each chapter starts with a great overview of how things are in the real world, with a very good level of detail. Also, addressing OS design from the point of view of performance could be a way to turn those courses into something more attractive for students: many people like to measure, benchmark, optimize things, and it s quite easy to demonstrate how different designs, or different configurations, make a big difference in terms of performance in the context of OS design. It s possible to be a sysadmin and ignore, say, the existence of the VFS, but there s a large class of problems that you will never be able to solve. It can be a good trade-off for a curriculum (e.g. at the BSc level) to decide to ignore most of the low-level stuff, but it s important to be aware of it. Students also need to learn how to design a proper infrastructure (that meets requirements in terms of scalability, availability, security, and maybe elasticity). Yes, backups are important. But monitoring is, too. As well as high availability. In order to scale, it s important to be able to automatize stuff. Russell writes that Sysadmins need some programming skills, but that s mostly scripting and basic debugging. Well, when you design an infrastructure, or when you use configuration management tools such as Puppet, in some sense, you are programming, and in terms of needs to abstract things, it s actually similar to doing object-oriented programming, with similar choices (should I use that off-the-shelf puppet module, or re-develop my own? How should everything fit together?). Also, when debugging, it s often useful to be able to dig into code, understand what the developer was trying to do, and if the expected behavior actually matches what you are seeing. It often results in spending a lot of time to create a one-line fix, and it requires very advanced programming skills. Again, it s possible to be a sysadmin with only limited software development knowledge, but there s a large class of things that you are unlikely to address properly. I think that what makes sysadmins jobs both very interesting and very challenging is that they require a very wide range of knowledge. There s often the ability to learn about new stuff (much more than in software development jobs). Of course, the difficult question is where to draw the line. What is the sysadmin knowledge that every CS graduate should have, even in curriculums not targeting sysadmin jobs? What is the sysadmin knowledge for a sysadmin BSc degree? for a sysadmin MSc degree?

6 June 2016

Reproducible builds folks: Reprotest has a preliminary CLI and configuration file handling

Author: ceridwen This is the first draft of reprotest's interface, and I welcome comments on how to improve it. At the moment, reprotest's CLI takes two mandatory arguments, the build command to run and the build artifact file to test after running the build. If the build command or build artifact have spaces, they have to be passed as strings, e.g. "debuild -b -uc -us". For optional arguments, it has --variations, which accepts a list of possible build variations to test, one or more of 'captures_environment', 'domain_host', 'filesystem', 'home', 'kernel', 'locales', 'path', 'shell', 'time', 'timezone', 'umask', and 'user_group' (see variations for more information); --dont_vary, which makes reprotest not test any variations in the given list (the default is to run all variations); --source_root, which accepts a directory to run the build command in and defaults to the current working directory; and --verbose, which will eventually enable more detailed logging. To get help for the CLI, run reprotest -h or reprotest --help. The config file has one section, basics, and the same options as the CLI, except there's no dont_vary option, and there are build_command and artifact options. If build_command and/or artifact are set in the config file, reprotest can be run without passing those as command-line arguments. Command-line arguments always override config file options. Reprotest currently searches the working directory for the config file, but it will also eventually search the user's home directory. A sample config file is below.
[basics]
build_command = setup.py sdist
artifact = dist/reprotest-0.1.tar.gz
source_root = reprotest/
variations =
  captures_environment
  domain_host
  filesystem
  home
  host
  kernel
  locales
  path
  shell
  time
  timezone
  umask
  user_group
At the moment, the only build variations that reprotest actually tests are the environment variable variations: captures_environment, home, locales, and timezone. Over the next week, I plan to add the rest of the basic variations and accompanying tests. I also need to write tests for the CLI and the configuration file. After that, I intend to work on getting (s)chroot communication working, which will involve integrating autopkgtest code. Some of the variations require specific other packages to be installed: for instance, the locales variation currently requires the fr_CH.UTF-8 locale. Locales are a particular problem because I don't know of a way in Debian to specify that a given locale must be installed. For other packages, it's unclear to me whether I should specify them as depends or recommends: they aren't dependencies in a strict sense, but marking them as dependencies will make it easier to install a fully-functional reprotest. When reprotest runs with variations enabled that it can't test because it doesn't have the correct packages installed, I intend to have it print a warning but continue to run. tests.reproducible-builds.org also has different settings, such as different locales, for different architectures. I'm not clear on why this is. I'd prefer to avoid having to generate a giant list of variations based on architecture, but if necessary, I can do that. The prebuilder script contains variations specific to Linux, to Debian, and to pbuilder/cowbuilder. I'm not including Debian-specific variations until I get much more of the basic functionality implemented, and I'm not sure I'm going to include pbuilder-specific variations ever, because it's probably better for extensibility to other OSes, e.g. BSD, to add support for plugins or more complicated configurations. I implemented the variations by creating a function for each variation. Each function takes as input two build commands, two source trees, and two sets of environment variables and returns the same. At the moment, I'm using dictionaries for the environment variables, mutating them in-place and passing the references forward. I'm probably going to replace those at some point with an immutable mapping. While at the moment, reprotest only builds on the existing system, when I start extending it to other build environments, this will require double-dispatch, because the code that needs to be executed will depend on both the variation to be tested and the environment being built on. At the moment, I'm probably going to implement this with a dictionary with tuple keys of (build_environment, variation) or nested dictionaries. If it's necessary for code to depend on OS or architecture, too, this could end up becoming a triple or quadruple dispatch.

24 May 2016

Alberto Garc a: I/O bursts with QEMU 2.6

QEMU 2.6 was released a few days ago. One new feature that I have been working on is the new way to configure I/O limits in disk drives to allow bursts and increase the responsiveness of the virtual machine. In this post I ll try to explain how it works. The basic settings First I will summarize the basic settings that were already available in earlier versions of QEMU. Two aspects of the disk I/O can be limited: the number of bytes per second and the number of operations per second (IOPS). For each one of them the user can set a global limit or separate limits for read and write operations. This gives us a total of six different parameters. I/O limits can be set using the throttling.* parameters of -drive, or using the QMP block_set_io_throttle command. These are the names of the parameters for both cases:
-drive block_set_io_throttle
throttling.iops-total iops
throttling.iops-read iops_rd
throttling.iops-write iops_wr
throttling.bps-total bps
throttling.bps-read bps_rd
throttling.bps-write bps_wr
It is possible to set limits for both IOPS and bps at the same time, and for each case we can decide whether to have separate read and write limits or not, but if iops-total is set then neither iops-read nor iops-write can be set. The same applies to bps-total and bps-read/write. The default value of these parameters is 0, and it means unlimited. In its most basic usage, the user can add a drive to QEMU with a limit of, say, 100 IOPS with the following -drive line:
-drive file=hd0.qcow2,throttling.iops-total=100
We can do the same using QMP. In this case all these parameters are mandatory, so we must set to 0 the ones that we don t want to limit:
     "execute": "block_set_io_throttle",
     "arguments":  
        "device": "virtio0",
        "iops": 100,
        "iops_rd": 0,
        "iops_wr": 0,
        "bps": 0,
        "bps_rd": 0,
        "bps_wr": 0
      
    
I/O bursts While the settings that we have just seen are enough to prevent the virtual machine from performing too much I/O, it can be useful to allow the user to exceed those limits occasionally. This way we can have a more responsive VM that is able to cope better with peaks of activity while keeping the average limits lower the rest of the time. Starting from QEMU 2.6, it is possible to allow the user to do bursts of I/O for a configurable amount of time. A burst is an amount of I/O that can exceed the basic limit, and there are two parameters that control them: their length and the maximum amount of I/O they allow. These two can be configured separately for each one of the six basic parameters described in the previous section, but here we ll use iops-total as an example. The I/O limit during bursts is set using iops-total-max , and the maximum length (in seconds) is set with iops-total-max-length . So if we want to configure a drive with a basic limit of 100 IOPS and allow bursts of 2000 IOPS for 60 seconds, we would do it like this (the line is split for clarity):
   -drive file=hd0.qcow2,
          throttling.iops-total=100,
          throttling.iops-total-max=2000,
          throttling.iops-total-max-length=60
Or with QMP:
     "execute": "block_set_io_throttle",
     "arguments":  
        "device": "virtio0",
        "iops": 100,
        "iops_rd": 0,
        "iops_wr": 0,
        "bps": 0,
        "bps_rd": 0,
        "bps_wr": 0,
        "iops_max": 2000,
        "iops_max_length": 60,
      
    
With this, the user can perform I/O on hd0.qcow2 at a rate of 2000 IOPS for 1 minute before it s throttled down to 100 IOPS. The user will be able to do bursts again if there s a sufficiently long period of time with unused I/O (see below for details). The default value for iops-total-max is 0 and it means that bursts are not allowed. iops-total-max-length can only be set if iops-total-max is set as well, and its default value is 1 second. Controlling the size of I/O operations When applying IOPS limits all I/O operations are treated equally regardless of their size. This means that the user can take advantage of this in order to circumvent the limits and submit one huge I/O request instead of several smaller ones. QEMU provides a setting called throttling.iops-size to prevent this from happening. This setting specifies the size (in bytes) of an I/O request for accounting purposes. Larger requests will be counted proportionally to this size. For example, if iops-size is set to 4096 then an 8KB request will be counted as two, and a 6KB request will be counted as one and a half. This only applies to requests larger than iops-size: smaller requests will be always counted as one, no matter their size. The default value of iops-size is 0 and it means that the size of the requests is never taken into account when applying IOPS limits. Applying I/O limits to groups of disks In all the examples so far we have seen how to apply limits to the I/O performed on individual drives, but QEMU allows grouping drives so they all share the same limits. This feature is available since QEMU 2.4. Please refer to the post I wrote when it was published for more details. The Leaky Bucket algorithm I/O limits in QEMU are implemented using the leaky bucket algorithm (specifically the Leaky bucket as a meter variant). This algorithm uses the analogy of a bucket that leaks water constantly. The water that gets into the bucket represents the I/O that has been performed, and no more I/O is allowed once the bucket is full. To see the way this corresponds to the throttling parameters in QEMU, consider the following values:
  iops-total=100
  iops-total-max=2000
  iops-total-max-length=60
bucket The bucket is initially empty, therefore water can be added until it s full at a rate of 2000 IOPS (the burst rate). Once the bucket is full we can only add as much water as it leaks, therefore the I/O rate is reduced to 100 IOPS. If we add less water than it leaks then the bucket will start to empty, allowing for bursts again. Note that since water is leaking from the bucket even during bursts, it will take a bit more than 60 seconds at 2000 IOPS to fill it up. After those 60 seconds the bucket will have leaked 60 x 100 = 6000, allowing for 3 more seconds of I/O at 2000 IOPS. Also, due to the way the algorithm works, longer burst can be done at a lower I/O rate, e.g. 1000 IOPS during 120 seconds. Acknowledgments As usual, my work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the QEMU development team. igalia-outscale Enjoy QEMU 2.6!

Next.

Previous.